Weather Impact on NYC Inspections

Author

Robert Neagu

Published

December 15, 2025

Introduction

If you ever took a close look at a NYC building entrance, you may have seen an inspection rating such as energy efficiency or food rating. These are given by the DCWP (Department of Consumer and Worker Protection) and the DOHMH (Department of Health and Mental Hygiene) respectively through an inspection. Based on my findings near Baruch College, many buildings have a low energy rating while nearby restaurant inspection ratings are mostly A grades. This raised questions on why these two inspection types are different from each other in grading. Perhaps there is another factor that resulted in the grade, such as the weather? Maybe the inspector came during a rainy day and was grumpy, causing a harsh tone on their inspection? With these thoughts, I wanted to answer the overarching question:

“How much do weather conditions impact the results of inspections of establishments in New York City?”

This question may seem odd at first, but think about its implications. Bias due to weather mood can falsely advertise a business as not meeting standards when they actually are. Temperature could affect interior structures through cracks on walls and make food preservation more difficult. Precipitation and snowfall impact resource management, giving the business difficulty to look perfect. Factors like these are what I want to explore with the overarching question.

The report will explore the relationship between weather conditions and inspection results in NYC by answering the following specific questions (SQs):

  • What Weather Conditions Show the Lowest and Highest Inspection Results Over Time? Are There Areas Favored Over Others?
  • Which Months of the Year are Best for Food Inspection?
  • Is there any Relationship With What Type of Food Was Being Inspected?
  • What Relationship Exists For The Inspector After Completing Each Subsequent Inspection? Does a Difference Exist between Restaurant Inspectors and Worker Inspectors?

All specific questions are treated as separate research that will come together in the conclusion to answer the overarching question.

Prior Literature

To further show motivation on the overarching question, past researches of weather and inspection relationships were looked at. One study from ScienceDirect, Hot Weather Impacts on New York City Restaurant Food Safety Violations and Operations looked into whether high temperatures increase food safety risks in NYC restaurants.

The study showcased that nearly all restaurants took no preventative action to maintain perishable foods. This likely caused food safety violations from inspections to increase during high temperature days. The authors concluded that these weather days, alongside power outages for food preservation, “likely increase food safety risks in restaurants” and suggested guidelines be implemented on restaurants to mitigate these risks.

Comparing this prior research to our report, there are similarities and differences to note. Both studies looked into the same location, providing more parity on findings when looking at the results. However, this prior research took a more narrow approach compared to the overarching question by only looking at restaurants and potential food risks during high temperature days. In addition, data used was from years 2011 to 2015; our study looks at years 2023 to 2025. Looking at data nearly a decade apart is a good way to determine if the claim on food risk is still valid, especially due to increased global warming.

Data Sources

Four data sources were used in this study: two on inspection results, weather data, and a map of NYC council districts. Each data source is explained in detail in their own sub-section alongside how they were obtained. Note that obtaining the data followed a similar process for all sources by using an API to download necessary files.

DCWP Inspections

Found in the NYC Open Data database, this showcases all the details of a DCWP inspection for a business. Many inspection types have been performed from parking violations to having official permits to sell specific items. About 210,000 rows of data have been entered for inspections between July 2023 through October 2025. Given an API is used to extract the data, the amount of rows will increase when downloading the file at a later time.

Given the large amount of columns, only a selection of them have been used in the analysis. The most notable are:

  • inspection_type: What type of inspection was performed
  • inspection_status: Whether a violation was issued
  • zip_code: Useful for location bias
  • latitude & longitude: Allows visual analysis through the NYC map

Despite having nearly 30 columns, there are some limitations to keep in mind with this data set. For example, having only 3 years of data may skew results as years 2023 to 2025 were full of uncertainty after the COVID-19 pandemic. In addition, inspection_status is best interpreted as a yes or no column, giving up specifics about the violation.

DOHMH New York City Restaurant Inspection Results

Found in the NYC Open Data database, this showcases specific details about the grade of a restaurant or cafe. These include a description of the cuisine, when the inspection took place, the violation details, grade received, and type of inspection. About 293,000 rows of data have been entered for inspections between August 2018 through December 7, 2025. The API approach is used to obtain the data, allowing new entries to appear when downloading at a later time.

Given the large amount of columns, only a selection of them have been used in the analysis. The most notable are:

  • inspection_date: Date when the inspection was performed. Dates of 1/1/1900 mean no inspection was recorded.
  • grade: Letter grade of A, B, C. Z and P indicate pending grade.
  • location: Shows where the business is using latitude and longitude. Provides easy implementation to the NYC map.

This data set does not have any noticeable limitations on its own besides the removal of empty grade entries as this is also dictated by the date. Also, like DCWP inspections, this data set has no specifications about the weather besides the inspection date.

Open Meteo Weather Data

Open Meteo is a weather API that provides flexibility on the desired data set from its archive of recorded weather. There are many columns that can be included all based on weather. Given that DCWP Inspections has more narrow data, daily weather between January 2010 through December 2025 was used to have weather data that can cover both inspection data sets within a reasonable time span.

The following columns were used for daily weather in NYC:

  • weather_code (wmo code): Overall intensity of weather for the day. Higher values indicate greater intensity.
  • Temperature: Recorded in Fahrenheit, 3 separate columns for the average, maximum, and minimum temperature recorded for the day.
  • wind_speed_10m_max (mp/h): Maximum wind speed recorded in the day in miles per hour (mp/h).
  • precipitation_sum (mm): Total precipitation recorded in the day in millimeters (mm).
  • rain_sum (mm): Total rain recorded of the day in millimeters.
  • snowfall_sum (cm): Total snowfall accumulated during the day in centimeters.

In the report’s context, the main limitation is not having weather available at the specific inspection site and time since weather is generalized daily at a specific point. While an hourly option is available, we do not know what time an inspection took place.

NYC City Council Districts

Sourced from NYC Department of Planning, we can get a shape file, denoted .shp, to visualize the council districts across NYC using latitude and longitude from the inspection data sets. This provides an easy way to group together businesses and see density differences across NYC. No columns are provided and version 25C was used.

The biggest limitation of this approach is not being super specific about location and having districts that overlap with zip codes. This is why latitude and longitude are important columns to use.

Answering Supporting Questions & Analysis

With an understanding of what data is being worked with, the Supporting Questions (SQs) will have their own section describing what they are achieving to support answering the overarching question. Analysis will also be presented, either visually or through specific values.

What Weather Conditions Show the Lowest and Highest Inspection Results Over Time? Are There Areas Favored Over Others?

This is our starting SQ, providing standard information about what to expect in the later questions. DCWP inspections were used for this question given it covers a large variety of inspection types compared to DOHMH. Weather Code was used to generalize weather conditions and data was grouped by month to collect total violations. The results found are:

  • July 2023 had the lowest violation count of 1 with a weather code of 51
  • October 2024 had the highest violation count of 2,783 with a weather code of 3

These results indicate that extreme summer months are likely to have few violations while a standard autumn month may exhibit a high violation count. A visual was also made to determine possible favoritism based on council district:

The visual shows the weather code between July 2023 through October 2025 alongside the violation count in each district per month. Most of the time, lower and mid Manhattan exhibit the most violations regardless of weather code. However, there are instances where the violation count greatly increased across many districts in a particular month.

Therefore, there is little to no impact on DHCP inspections and weather code. Only extreme cases of heat can increase violations but is unlikely to happen. Also, Mid-Lower Manhattan has the greatest violation volatility based on weather and day. Areas of Queens and Brooklyn adjacent to Manhattan, alongside Staten Island, also see some volatility.

Limitations of this approach include only looking at monthly data instead of daily data. Daily data would be more specific, however, obtaining patterns would be more complex. Monthly data may not be fully representative of what is given, but provides a close enough summary for validity.

Which Months of the Year are Best for Food Inspection?

Here, DOHMH data will be used to determine which month(s) are best for food inspection as it contains businesses that are only food based. Average temperature of the day will be showcased as the weather factor given weather code was not reliable from the previous question. This connects with the overarching question on a restaurant based level and looks at another weather metric and its impact on inspections.

Monthly data is used as a summary of daily weather and years 2022 to 2025 are looked at. Graphical representation of percent of A grade inspections are also looked at, showcasing the overall chance of getting an A grade. This makes the interactive visualization shown below.

Loading Libraries
#Obtaining data and performing SQL like commands
library(sf)
library(tidyverse)
library(httr2)

#Data injection
library(glue)
library(readxl)
library(tidycensus)

#Display datatables
library(DT)

#Visualization libraries
library(ggplot2)
library(plotly)
library(viridis)
library(gganimate)
library(scales)

#QOL
library(tidyr)
library(lubridate)
library(readr)
Interactive Temperature Chart of Grade A
##Downloading NYC Restaurant Data
restaurant_data_path <- "./data/Final/DOHMH_NYC_Restaurants.csv"

if(!file.exists(restaurant_data_path)){
  # Download csv file from NYC Open Data API endpoint. Set limit to 300k to download all rows
  download.file(url = "https://data.cityofnewyork.us/resource/43nn-pn8j.csv?$limit=300000",
                destfile = restaurant_data_path, mode = "wb")
}
#Read DOHMH Restaurant data
restaurant_data <- read_csv(restaurant_data_path)


### Downloading weather data via API:
##Downloading Weather Data
weather_path <- "./data/Final/weather_data.csv"

# Check if file already exists
if(!file.exists(weather_path)){
    # Create a temporary file to store the downloaded data
    tmp <- tempfile(fileext = ".csv")
    download.file(url = "https://archive-api.open-meteo.com/v1/archive?latitude=40.7143&longitude=-74.006&start_date=2010-01-01&end_date=2025-12-05&daily=weather_code,temperature_2m_mean,temperature_2m_max,temperature_2m_min,wind_speed_10m_max,daylight_duration,precipitation_sum,rain_sum,snowfall_sum&timezone=auto&temperature_unit=fahrenheit&wind_speed_unit=mph&format=csv",
                  destfile = tmp, mode = "wb")
}
#Read the weather data, omitting unnecessary columns
weather_data <- read_csv(weather_path, skip=2)

#Data Cleaning (restaurants)
#Select relevant columns
clean_restaurant_data <- restaurant_data |>
  reframe(`camis`, `dba`, `boro`, `inspection_date`, `grade`, `grade_date`, `score`, `cuisine_description`, council_district, location)

#Filter data based on weather data date range
clean_restaurant_data <- clean_restaurant_data |>
  filter(`grade_date` >= as.Date('2022-01-01'))
#Filter out rows with missing data in key columns
clean_restaurant_data <- clean_restaurant_data |>
  filter(!is.na(location) & !is.na(inspection_date) & !is.na(grade) & !is.na(score) & !is.na(cuisine_description))

## Creating the graph
# Ensure dates and create year/month
restaurant_isA <- clean_restaurant_data |>
    mutate(
        inspection_date = as.Date(inspection_date),
        grade_date = as.Date(grade_date),
        year = year(inspection_date),
        month = month(inspection_date, label = TRUE, abbr = TRUE),
        grade_clean = toupper(trimws(grade)),
        is_A = if_else(grade_clean == "A", 1L, 0L)
    )

# Summarize percent A by month and year
inspections_summary <- restaurant_isA |>
    group_by(month, year) |>
    summarise(
        n_total = n(),
        n_A = sum(is_A, na.rm = TRUE),
        pct_A = 100 * n_A / n_total,
        .groups = "drop"
    )

# Try to extract a temperature column from weather_data (common names)
temp_cols <- names(weather_data)[grepl("temp|temperature|t2m|temperature_2m", names(weather_data), ignore.case = TRUE)]

show_temp <- FALSE
if (length(temp_cols) > 0) {
    tcol <- temp_cols[1]
    weather_monthly <- weather_data |>
        mutate(time = as.Date(time)) |>
        mutate(
            year = year(time),
            month = month(time, label = TRUE, abbr = TRUE)
        ) |>
        group_by(year, month) |>
        summarise(mean_temp = mean(.data[[tcol]], na.rm = TRUE), .groups = "drop")

    inspections_summary <- inspections_summary |>
        left_join(weather_monthly, by = c("year", "month"))

    min_pct <- min(inspections_summary$pct_A, na.rm = TRUE)
    max_pct <- max(inspections_summary$pct_A, na.rm = TRUE)
    min_temp <- min(inspections_summary$mean_temp, na.rm = TRUE)
    max_temp <- max(inspections_summary$mean_temp, na.rm = TRUE)

    if (is.finite(min_temp) && (max_temp - min_temp) > 0) {
        inspections_summary <- inspections_summary |>
            mutate(mean_temp_scaled = (mean_temp - min_temp) / (max_temp - min_temp) * (max_pct - min_pct) + min_pct)
        show_temp <- TRUE
    }
}

# Order months Jan..Dec
inspections_summary$month <- factor(inspections_summary$month, levels = month(1:12, label = TRUE, abbr = TRUE))

# Static plot but add text for hover
p <- ggplot(inspections_summary, aes(x = month, y = pct_A, fill = factor(year))) +
    geom_col(aes(text = paste0("Year: ", year, "<br>Month: ", month, "<br>Percent A: ", round(pct_A, 1), "%<br>N: ", n_total)),
                     position = position_dodge(width = 0.9), color = "black", width = 0.8) +
    scale_fill_brewer(palette = "Set2", name = "Year") +
    labs(x = "Month", y = "Percent Grade A", title = "Percent Grade A by Month (2022-2025)") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 0, vjust = 0.5))

if (show_temp) {
    p <- p +
        geom_point(aes(y = mean_temp_scaled, group = factor(year), color = factor(year),
                                     text = paste0("Year: ", year, "<br>Month: ", month, "<br>Mean temp: ", round(mean_temp, 1))),
                             position = position_dodge(width = 0.9), size = 2) +
        scale_color_brewer(palette = "Dark2", name = "Year (temp)") +
        guides(color = guide_legend(order = 2), fill = guide_legend(order = 1)) +
        labs(title = paste0("Percent Grade A by Month (2022-2025)\nAverage Temperature (ºF) shown as points scaled to the %A range"))
}

# Convert to interactive plotly object with hover showing the text field
p <- ggplotly(p, tooltip = "text")
p

You can hover over the bars and points of the visualization to see more details about them. Starting with the points, there is an expected correlation between temperature and month. This is most apparent during the summer where temperature between the years are very close to each other. Compared to other months, temperature has more fluctuations. Looking at the bars reveal high volatility of percent A grades for each month, making months seem like an unreliable measure.

Based on the analysis, there is a weak relationship between increasing temperature and month where inspection risk increases. Despite this, our data shows May and June are the best months for a food inspection given their consistency of having a very high A grade percent.

A limitation of this approach is not using all reliable years of the restaurant data. Having more rows allows a longer trend to be seen with possibly more clear results. Also, analysis of the chart is subjective given high volatility and steep decline in the A grade percent in year 2025.

Is there any Relationship With What Type of Food Was Being Inspected?

Since this SQ looks into food, the same DOHMH data will be used. To support the overarching question, the weather looked at will be precipitation and snowfall. Other weather types are omitted as they showed little to no relationship based on prior SQs. Given the chosen weather data can be difficult to look at based on raw values, normalization was performed for easier analysis. This will then tie into the cuisine, representing food type, and see if any impact exists.

Precipitation and snowfall results will be shown in separate visualizations. Daily data is now used as the goal is to find the top cuisines based on how many have an A grade. Years 2010 to 2025 will be used. Below is the precipitation visualization.

Precipitation Visualization Code
#Re-clean restaurant data to include years starting 2010
#Select relevant columns
clean_restaurant_data <- restaurant_data |>
  reframe(`camis`, `dba`, `boro`, `inspection_date`, `grade`, `grade_date`, `score`, `cuisine_description`, council_district, location)

#Filter data starting in year 2010.
clean_restaurant_data <- clean_restaurant_data |>
  filter(`grade_date` >= as.Date('2010-01-01'))
#Filter out rows with missing data in key columns
clean_restaurant_data <- clean_restaurant_data |>
  filter(!is.na(location) & !is.na(inspection_date) & !is.na(grade) & !is.na(score) & !is.na(cuisine_description))

# ensure date columns are Date
nyc_restaurant_inspections <- clean_restaurant_data |>
  mutate(grade_date = as.Date(grade_date),
         inspection_date = as.Date(inspection_date))

wd_names <- names(weather_data)
date_col  <- wd_names[grepl("date|time", wd_names, ignore.case = TRUE)][1]
precip_col <- wd_names[grepl("precip", wd_names, ignore.case = TRUE)][1]

# normalize weather to daily precipitation using detected columns
weather_daily <- weather_data |>
  rename(weather_date = !!rlang::sym(date_col),
         precipitation = !!rlang::sym(precip_col)) |>
  mutate(weather_date = as.Date(weather_date),
         precipitation = as.numeric(precipitation)) |>
  group_by(weather_date) |>
  summarize(daily_precip = sum(precipitation, na.rm = TRUE), .groups = "drop")

# restrict inspections to the full date range present in weather_data (all years)
date_min <- min(weather_daily$weather_date, na.rm = TRUE)
date_max <- max(weather_daily$weather_date, na.rm = TRUE)

inspections_in_range <- nyc_restaurant_inspections |>
  filter(grade_date >= date_min & grade_date <= date_max)

inspections_with_precip <- inspections_in_range |>
  left_join(weather_daily, by = c("grade_date" = "weather_date")) |>
  filter(!is.na(daily_precip))

# compute cuisine counts and select common cuisines (top 20)
cuisine_counts <- inspections_with_precip |> count(cuisine_description, sort = TRUE)
top_common_cuisines <- cuisine_counts |> slice_head(n = 20) |> pull(cuisine_description)

# compute mean precipitation per cuisine across all years and take top 10
top10_by_precip <- inspections_with_precip |>
  filter(cuisine_description %in% top_common_cuisines) |>
  group_by(cuisine_description) |>
  summarize(mean_precip = mean(daily_precip, na.rm = TRUE),
            median_precip = median(daily_precip, na.rm = TRUE),
            inspections = n(),
            .groups = "drop") |>
  arrange(desc(mean_precip)) |>
  slice_head(n = 10)

ggplot(top10_by_precip, aes(x = reorder(cuisine_description, mean_precip), y = mean_precip)) +
  geom_col(fill = "blue") +
  coord_flip() +
  labs(title = "Top 10 Cuisines by Mean Daily Precipitation on Inspection Days",
       x = "Cuisine",
       y = "Mean Daily Precipitation (in mm, normalized)") +
  theme_bw()

Despite normalizing the data, the top 10 cuisines are very close in value to each other. This in an indication that precipitation has nearly no effect on the inspection result. Let’s see if snowfall shows any difference by also including more top cuisines.

Snowfall Visualization Code
# classify inspections as A vs Not A
nyc_restaurant_inspections <- clean_restaurant_data |>
  mutate(grade_A = if_else(toupper(trimws(grade)) == "A", "A", "Not A"))

# auto-detect weather date and snow columns
wd_names <- names(weather_data)
date_col  <- wd_names[grepl("date|time", wd_names, ignore.case = TRUE)][1]
snow_col  <- wd_names[grepl("snow", wd_names, ignore.case = TRUE)][1]

if (is.na(date_col) || is.na(snow_col)) {
  stop("Could not auto-detect weather date/snowfall columns. Available cols: ", paste(wd_names, collapse = ", "))
}

# normalize weather to daily snowfall using detected columns
weather_daily_snow <- weather_data |>
  rename(weather_date = !!rlang::sym(date_col),
         snowfall = !!rlang::sym(snow_col)) |>
  mutate(weather_date = as.Date(weather_date),
         snowfall = as.numeric(snowfall)) |>
  group_by(weather_date) |>
  summarize(daily_snow = sum(snowfall, na.rm = TRUE), .groups = "drop")

# restrict inspections to the full date range present in weather_data (all years)
date_min <- min(weather_daily_snow$weather_date, na.rm = TRUE)
date_max <- max(weather_daily_snow$weather_date, na.rm = TRUE)

inspections_in_range <- nyc_restaurant_inspections |>
  filter(grade_date >= date_min & grade_date <= date_max)

inspections_with_snow <- inspections_in_range |>
  left_join(weather_daily_snow, by = c("grade_date" = "weather_date")) |>
  filter(!is.na(daily_snow))

# compute cuisine counts and select common cuisines (top 20)
cuisine_counts <- inspections_with_snow |> count(cuisine_description, sort = TRUE)
top_common_cuisines <- cuisine_counts |> slice_head(n = 20) |> pull(cuisine_description)

# compute mean snowfall per cuisine split by A vs Not A
cuisine_by_grade_snow <- inspections_with_snow |>
  filter(cuisine_description %in% top_common_cuisines) |>
  group_by(cuisine_description, grade_A) |>
  summarize(mean_snow = mean(daily_snow, na.rm = TRUE),
            median_snow = median(daily_snow, na.rm = TRUE),
            inspections = n(),
            .groups = "drop")

# Bars for A only, descending order (largest at top), no borders
plot_data <- cuisine_by_grade_snow |>
  filter(grade_A == "A") |>
  arrange(desc(mean_snow)) |>
  # set factor levels so the largest mean_snow appears at the top when flipped
  mutate(cuisine_description = factor(cuisine_description, levels = rev(cuisine_description)))

#Plot the data
ggplot(plot_data, aes(x = cuisine_description, y = mean_snow)) +
  geom_col(fill = "#1f78b4", color = NA, width = 0.75) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
  labs(title = "Top Cuisines Based on Snowfall (Grade A only)",
       x = "Cuisine",
       y = "Mean Daily Snowfall (in mm)") +
  theme_bw(base_size = 12) +
  theme(axis.text.y = element_text(face = "bold"),
        plot.title = element_text(face = "bold"))

Snowfall seems to have a more clear relationship with the cuisine that was inspected. It seems that cuisines that are culture based tend to have more grade A inspections. This includes Spanish, Italian, Hispanic, and Asian based cuisines.

The analysis showcases that a culture-based relationship exists when there is increased snowfall and giving a higher grade. Precipitation and rain have no relationship with inspection results.

A limitation of the used approach is not having the weather type for each specific location as precipitation in one borough can be different compared to another borough. The restaurant data would have to include weather type to resolve this limitation.

What Relationship Exists For The Inspector After Completing Each Subsequent Inspection? Does a Difference Exist between Restaurant Inspectors and Worker Inspectors?

Both DCWP and DOHMH data are used for this section with the goal of comparing consistency between these types of inspectors. This will show the relationship after completing each subsequent inspection. While weather data is not included, the purpose of this SQ is to determine whether the type of inspection done matters for future analyses. Years 2023 to 2025 are used as the DCWP is more narrow compared to DOHMH. Consistency is summarized below:

  • DCWP inspections has a consistency rate of ~36%.
  • DOHMH inspections has a consistency rate of ~94%

The summary showcases that DCWP inspectors are more difficult to predict their outcome while DOHMH inspectors provide the same grade almost all the time. This can be further shown by creating a visualization comparing consistency and number of establishments, shown below.

Consistency Graph Code
# Read DCWP data (was never read yet)
##Downloading DCWP data
DCWP_path <- "./data/Final/DCWP_Inspection_Data.csv"
if(!file.exists(DCWP_path)){
  # Download csv file from NYC Open Data API endpoint. Set limit to 300k to download all rows
  download.file(url = "https://data.cityofnewyork.us/resource/jzhd-m6uv.csv?$limit=300000",
                destfile = DCWP_path, mode = "wb")
}
#Read DCWP data
dcwp_data <- read_csv(DCWP_path)

# Combine DOHMH and DCWP inspections and compute consistency for 2023-2025
# (separate consistency computed per source so we can compare)

# Prepare DOHMH (clean_restaurant_data) - ensure dates and key fields are correct
dohm_inspections <- clean_restaurant_data |>
  mutate(inspection_date = as.Date(inspection_date),
         camis = as.character(camis),
         grade = as.character(grade)) |>
  select(id = camis, date = inspection_date, result = grade, score) |>
  mutate(id = as.character(id),
         date = as.Date(date),
         result = as.character(result),
         source = "DOHMH")

# Data Cleaning (DCWP)
# Select relevant columns
clean_DCWP <- dcwp_data |>
  reframe(certificate_of_inspection, business_unique_id, dcwp_license_number, inspection_type, inspection_status, zip_code, latitude, longitude, inspection_date = date_of_occurrence)
# Remove duplicates
clean_DCWP <- clean_DCWP |> distinct()
# Remove rows with missing or invalid coordinates
clean_DCWP <- clean_DCWP |> 
  filter(!is.na(longitude) & !is.na(latitude) & longitude != 0 & latitude != 0)

# Prepare DCWP (clean_DCWP) - use Business Unique ID as id and Inspection Status as result
dcwp_inspections <- clean_DCWP |>
  mutate(date = as.Date(inspection_date)) |>
  select(id = business_unique_id, date, result = inspection_status) |>
  mutate(id = as.character(id),
         date = as.Date(date),
         result = as.character(result),
         source = "DCWP")

# Keep only 2023-2025 and non-missing results, then bind rows
inspections_all <- bind_rows(dohm_inspections, dcwp_inspections) |>
  filter(!is.na(id) & !is.na(date) & !is.na(result)) |>
  filter(date >= as.Date("2023-01-01") & date <= as.Date("2025-12-31")) |>
  arrange(source, id, date)

# Compute per-establishment consistency (previous result same as current) within each source
inspection_consistency <- inspections_all |>
  group_by(source, id) |>
  arrange(date, .by_group = TRUE) |>
  mutate(previous_result = lag(result),
         consistency = ifelse(!is.na(previous_result) & result == previous_result, 1, 0)) |>
  filter(!is.na(previous_result)) |>
  ungroup()

# Per-establishment consistency rates
per_establishment <- inspection_consistency |>
  group_by(source, id) |>
  summarize(consistency_rate = mean(consistency, na.rm = TRUE), .groups = "drop")

# Histogram with two bars per bin (one per source) using position = "dodge"
consistency_plot <- ggplot(per_establishment, aes(x = consistency_rate, fill = source)) +
  geom_histogram(binwidth = 0.1, position = position_dodge(width = 0.1), color = "white", alpha = 0.8, boundary = 0) +
  scale_x_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
  scale_fill_manual(values = c("DOHMH" = "#1f77b4", "DCWP" = "#ff7f0e")) +
  labs(title = "Inspection Consistency Rates by Source (2023-2025)",
       x = "Consistency Rate",
       y = "Number of Establishments",
       fill = "Source") +
  theme_minimal()

print(consistency_plot)

It is noticeable that DCWP inspectors have consistency spread throughout the graph. The most notable values are 0 and 1 for consistency rate but DCWP inspectors have quite a few thousand establishments in-between this range. Meanwhile, almost all DOHMH inspectors have a consistency rate of 1. The lowest value starts at consistency 0.5 and continues growing until an exponential spike at 1.

This analysis showcases that DCWP inspector variation makes it difficult to assess their inspection risks from one session to the next. Meanwhile, DOHMH inspectors are very likely to give the same grade result. A clear difference is shown between restaurant and worker inspectors given the consistency difference.

A limitation of this approach is the lack of data used as DCWP inspectors are only recorded since 2023. Having more years can alter the consistency rate and final analysis.

Conclusion and Next Steps

Given the analysis performed for each question, the following key takeaways are:

  • Density of buildings in an area likely increases inspection risk
  • Snowfall can improve the chances of a good inspection rating, especially if the business is culture oriented
  • DCWP inspectors are harder to guess compared to DOHMH

Future works can incorporate additional data on inspections, such as the weather type during the inspection. It is also suggested to look at the relationship with inspection consistency and favored areas, determining if consistency is based on the area.